Observability, logging, and debugging in .NET desktop systems
This topic sounds boring until the day a machine fails in front of a customer, the UI looks “stuck,” the inspection workflow stops halfway, and nobody can reproduce it in development.
That is the moment when observability stops being a nice engineering idea and becomes the thing that separates a professional system from a fragile one.
In industrial desktop systems, observability is not just “write some logs.” It is how you understand what the software believed was happening, what the machine was doing, what the operator clicked, which background task failed, and why the workflow ended up in the wrong state.
PART 1 — BIG PICTURE
Why observability is critical in real systems
In a normal business app, a bug may mean a broken page or a failed request.
In a wafer inspection desktop system, a bug may mean:
- the machine stops mid-run
- an inspection result is partially saved
- a camera timeout causes a retry storm
- the UI shows “Running” while the machine is actually in fault state
- operators lose confidence because the system behaves unpredictably
These systems are hard because many things happen at the same time:
- UI events
- machine communication
- background processing
- image/result pipelines
- persistence
- hardware callbacks
- workflow orchestration
When something goes wrong, you need to answer questions like:
- What step was the workflow in?
- Which command was sent to the machine?
- Did the machine acknowledge it?
- Did a timeout happen before or after the response arrived?
- Did the UI reflect the real state or only some stale state?
- Was the failure complete, or did only one subsystem fail?
Without observability, you are guessing.
With observability, you can reconstruct the story.
Why debugging production issues is much harder than development
In development, life is friendly:
- you have the debugger
- you can step through code
- timing is slower and cleaner
- the environment is controlled
- hardware simulation may be stable
- logs are easy to inspect locally
In production or in the field, life is very different:
- the machine is real
- timing is different
- the user clicks in unpredictable ways
- network/device latency varies
- vendor SDKs behave differently under load
- failures are intermittent
- attaching a debugger may be impossible or too risky
The hardest bugs are usually not logic bugs. They are behavior bugs.
Examples:
- “It only happens once every 3 days.”
- “It happens only when the operator stops and quickly starts again.”
- “It fails only under real throughput.”
- “The machine replied, but the app acted like it timed out.”
- “The screen froze, but then recovered.”
Those problems are rarely solved by staring at code alone. They are solved by reconstructing real runtime behavior.
Why logs are often the only source of truth in field failures
In field failures, the logs are often the only witness that was actually present.
A customer-reported bug usually comes with poor input:
- “The app froze”
- “Inspection failed”
- “The machine disconnected”
- “The wrong result appeared”
- “It worked yesterday”
That is not enough.
A good logging system turns vague complaints into something actionable:
- run 20260321-104455 started with recipe RCP-12A on machine M-03
- wafer loaded successfully
- autofocus command sent
- machine response delayed 8.2 seconds
- timeout threshold 5 seconds exceeded
- workflow entered Recovery state
- image pipeline still processing last frame
- UI received stale status event after recovery
- background task faulted due to disposed channel writer
Now you have a real story.
That is observability: making invisible runtime behavior visible.
PART 2 — HOW IT ACTUALLY WORKS
Structured logging
A lot of teams say they have logging, but what they really have is text dumping.
Bad log:
Inspection failed for machine 3 with recipe abcLooks okay at first. But later, you want to search:
- all failures for a specific run
- all failures for machine M-03
- all failures for recipe ABC-2026
- all warnings before a specific error
- average duration of autofocus step across runs
Plain text makes this hard.
Structured logging means you log a message plus named fields.
Example conceptually:
Message:
Inspection step failedProperties:
RunId = "RUN-20260321-001"MachineId = "M-03"Recipe = "ABC-2026"Step = "AutoFocus"DurationMs = 8200ErrorCode = "Timeout"
Now your log backend, or even local file analysis, can filter and group by these fields.
This is a huge difference in production systems. You stop reading logs like novels and start querying them like data.
Log levels
Log levels are not just decoration. They are a signal of operational importance.
Information
Used for important normal events.
Examples:
- inspection run started
- recipe loaded
- machine connected
- workflow state changed
- run completed
Info logs tell the flow of the system.
Warning
Used when something is off, but the system can still continue.
Examples:
- machine response slower than normal
- retry triggered
- stale event ignored
- optional result save failed but run continues
- fallback path used
Warnings are important because they often explain why an eventual error happened later.
Error
Used for real failures that break an operation or require attention.
Examples:
- inspection aborted
- machine command failed
- unhandled background task exception
- database write failed for required result data
Errors should be meaningful, not noisy.
Debug / Trace
Used for very detailed internal behavior.
Examples:
- every message received from device protocol
- queue depth changes
- every retry attempt
- state machine transition checks
- timing between pipeline stages
Useful when diagnosing deep issues, but dangerous if always enabled at high volume.
Correlation of events across components
In a desktop machine-control system, one user action often triggers work in many layers:
- UI command
- workflow service
- machine controller
- camera service
- result processor
- file persistence
- event bus
- background worker
If each component logs independently with no shared context, the logs become useless.
You need correlation.
For one inspection run, every relevant log should carry shared identifiers such as:
RunIdMachineIdLotIdWaferIdRecipe- sometimes
SessionIdorOperationId
That lets you reconstruct one logical story even though the work spans many classes, tasks, and threads.
Without correlation, your logs are just fragments.
With correlation, they become a timeline.
PART 3 — REAL PROBLEMS IN THIS SYSTEM
Using:
“A WPF desktop app controlling a wafer inspection machine”
Tracing an inspection run from start to finish
A real inspection run is not one method call. It is a distributed conversation inside one process.
Typical flow:
- Operator selects recipe
- UI requests run start
- workflow validates readiness
- machine moves to load position
- wafer loads
- autofocus starts
- image acquisition begins
- defect pipeline processes frames
- results save incrementally
- summary generated
- workflow completes
If the user says, “Run failed halfway,” you need to know exactly where halfway was.
Good logging lets you see:
- when the run started
- which state transitions occurred
- which hardware commands were issued
- what the machine returned
- how long each step took
- where the first abnormal event appeared
That means you should log the major lifecycle:
- run created
- state transitions
- machine command send/response
- retries
- step duration
- completion/abort reason
Not every tiny internal method. The important story.
Diagnosing machine communication issues
Hardware integration bugs are painful because the software and machine each blame the other.
Typical problems:
- command sent but no reply
- reply arrived late
- malformed reply
- duplicate response
- disconnect during operation
- SDK callback on unexpected thread
- command acknowledged but machine never changed state
To diagnose this, logs need more than “communication failed.”
You need:
- command name
- machine/device id
- sequence or request id if available
- timeout threshold
- actual wait duration
- raw error code from SDK/protocol
- connection state before and after
- whether retry was attempted
For example, there is a big difference between:
- no response ever arrived
- response arrived after timeout
- response arrived but parser failed
- command succeeded but state poller still saw old state
These sound similar to the operator, but they have very different root causes.
Debugging race conditions or timing bugs
Timing bugs are the worst kind because the system often “looks correct” in code review.
Examples:
- stop command races with machine-complete event
- UI binds to stale view model state
- background consumer processes an old frame after run cancellation
- reconnect logic overlaps with active command execution
- event order differs under load
The only way to understand this is often timeline logging.
You need timestamps and context around:
- event received
- state transition requested
- state transition applied
- cancellation requested
- task completed
- queue item dequeued
- UI updated
Then you can see the ordering.
For example:
10:15:01.102 Run canceled10:15:01.110 Image frame received10:15:01.114 Frame processing started10:15:01.130 Result publish skipped because run is canceled
This tells a healthy story.
But if you instead see:
10:15:01.102 Run canceled10:15:01.130 Result publish completed
then you know canceled work still leaked through.
Understanding partial failures during workflows
Real systems often fail partially, not completely.
Examples:
- inspection completed, but thumbnail save failed
- machine moved correctly, but UI never updated
- main result saved, but defect overlay generation failed
- live stream disconnected, but inspection continued
- summary report failed, but raw data exists
If your logs only record final success/failure, you lose the nuance.
A mature system logs per sub-operation and records degradation clearly.
That matters because the recovery action depends on what failed.
- If only visualization failed, do not re-run the wafer.
- If raw images are missing, re-run may be required.
- If save completed but UI showed failure, the operator may need reassurance, not retry.
- If report generation failed after successful inspection, treat it as post-processing failure, not machine failure.
This is why workflow-level observability matters. It helps separate process failure from subsystem failure.
PART 4 — HOW WE USE IT IN .NET (PRACTICAL)
Below are practical patterns using Microsoft.Extensions.Logging.
Structured logging with context
using Microsoft.Extensions.Logging;
public sealed class InspectionService
{
private readonly ILogger<InspectionService> _logger;
private readonly IMachineController _machineController;
public InspectionService(
ILogger<InspectionService> logger,
IMachineController machineController)
{
_logger = logger;
_machineController = machineController;
}
public async Task StartInspectionAsync(
string runId,
string machineId,
string recipe,
CancellationToken cancellationToken)
{
_logger.LogInformation(
"Inspection run started. RunId={RunId}, MachineId={MachineId}, Recipe={Recipe}",
runId, machineId, recipe);
try
{
await _machineController.LoadRecipeAsync(machineId, recipe, cancellationToken);
_logger.LogInformation(
"Recipe loaded successfully. RunId={RunId}, MachineId={MachineId}, Recipe={Recipe}",
runId, machineId, recipe);
await _machineController.StartInspectionAsync(machineId, cancellationToken);
_logger.LogInformation(
"Inspection command accepted. RunId={RunId}, MachineId={MachineId}",
runId, machineId);
}
catch (OperationCanceledException)
{
_logger.LogWarning(
"Inspection run canceled. RunId={RunId}, MachineId={MachineId}, Recipe={Recipe}",
runId, machineId, recipe);
throw;
}
catch (Exception ex)
{
_logger.LogError(
ex,
"Inspection run failed during startup. RunId={RunId}, MachineId={MachineId}, Recipe={Recipe}",
runId, machineId, recipe);
throw;
}
}
}The important thing here is not the syntax. It is that the log carries reusable context.
Using scopes for correlated logs
Scopes are very useful in .NET logging when many downstream calls should automatically inherit the same context.
using Microsoft.Extensions.Logging;
public sealed class InspectionRunCoordinator
{
private readonly ILogger<InspectionRunCoordinator> _logger;
private readonly InspectionWorkflow _workflow;
public InspectionRunCoordinator(
ILogger<InspectionRunCoordinator> logger,
InspectionWorkflow workflow)
{
_logger = logger;
_workflow = workflow;
}
public async Task ExecuteRunAsync(
string runId,
string machineId,
string recipe,
CancellationToken cancellationToken)
{
using var scope = _logger.BeginScope(new Dictionary<string, object>
{
["RunId"] = runId,
["MachineId"] = machineId,
["Recipe"] = recipe
});
_logger.LogInformation("Inspection workflow execution started.");
await _workflow.PrepareAsync(cancellationToken);
await _workflow.RunAsync(cancellationToken);
await _workflow.CompleteAsync(cancellationToken);
_logger.LogInformation("Inspection workflow execution finished successfully.");
}
}Now every log inside that scope can inherit the contextual properties, depending on the logging provider.
This is one of the best ways to keep correlation consistent.
Logging important state transitions
State transitions are some of the most valuable logs in industrial systems.
public enum InspectionState
{
Idle,
Preparing,
Running,
Completing,
Completed,
Error,
Aborted
}
public sealed class InspectionStateMachine
{
private readonly ILogger<InspectionStateMachine> _logger;
public InspectionStateMachine(ILogger<InspectionStateMachine> logger)
{
_logger = logger;
}
public InspectionState CurrentState { get; private set; } = InspectionState.Idle;
public void TransitionTo(
InspectionState newState,
string runId,
string reason)
{
var oldState = CurrentState;
CurrentState = newState;
_logger.LogInformation(
"Inspection state changed. RunId={RunId}, OldState={OldState}, NewState={NewState}, Reason={Reason}",
runId, oldState, newState, reason);
}
}This kind of log is gold during incident analysis.
When the customer says, “It got stuck,” this tells you where it got stuck.
Capturing errors with enough context
A very common mistake is logging an exception without the operation context.
Bad:
_logger.LogError(ex, "Save failed");Better:
_logger.LogError(
ex,
"Failed to save inspection result. RunId={RunId}, WaferId={WaferId}, ResultPath={ResultPath}, Step={Step}",
runId,
waferId,
resultPath,
"PersistFinalResult");Now the log tells you:
- what failed
- for which run
- for which wafer
- at which step
- against which output path
That is the minimum needed to investigate.
Logging async and background operations correctly
Desktop systems often have background loops for polling, streaming, processing, and health monitoring.
These loops are dangerous because failures can become invisible.
public sealed class MachineStatusMonitor
{
private readonly ILogger<MachineStatusMonitor> _logger;
private readonly IMachineGateway _machineGateway;
public MachineStatusMonitor(
ILogger<MachineStatusMonitor> logger,
IMachineGateway machineGateway)
{
_logger = logger;
_machineGateway = machineGateway;
}
public async Task RunAsync(string machineId, CancellationToken cancellationToken)
{
using var scope = _logger.BeginScope(new Dictionary<string, object>
{
["MachineId"] = machineId
});
_logger.LogInformation("Machine status monitor started.");
while (!cancellationToken.IsCancellationRequested)
{
try
{
var status = await _machineGateway.GetStatusAsync(machineId, cancellationToken);
_logger.LogDebug(
"Machine status polled. Status={Status}, IsConnected={IsConnected}",
status.State,
status.IsConnected);
await Task.Delay(TimeSpan.FromMilliseconds(500), cancellationToken);
}
catch (OperationCanceledException)
{
_logger.LogInformation("Machine status monitor stopping due to cancellation.");
break;
}
catch (Exception ex)
{
_logger.LogError(ex, "Unhandled exception in machine status monitor loop.");
// Optional backoff to avoid hot failure loops
try
{
await Task.Delay(TimeSpan.FromSeconds(2), cancellationToken);
}
catch (OperationCanceledException)
{
break;
}
}
}
_logger.LogInformation("Machine status monitor stopped.");
}
}Important lessons here:
- background loops should log start and stop
- exceptions must be caught inside the loop
- repeated failures should not create a CPU-burning retry storm
- cancellation should be logged differently from errors
Timing important operations
Latency is often the hidden cause of workflow issues.
public async Task SendCommandAsync(
string machineId,
string commandName,
string runId,
CancellationToken cancellationToken)
{
var startedAt = DateTime.UtcNow;
var sw = System.Diagnostics.Stopwatch.StartNew();
try
{
_logger.LogInformation(
"Sending machine command. RunId={RunId}, MachineId={MachineId}, Command={Command}",
runId, machineId, commandName);
await _machineGateway.SendAsync(machineId, commandName, cancellationToken);
sw.Stop();
_logger.LogInformation(
"Machine command completed. RunId={RunId}, MachineId={MachineId}, Command={Command}, DurationMs={DurationMs}",
runId, machineId, commandName, sw.ElapsedMilliseconds);
}
catch (Exception ex)
{
sw.Stop();
_logger.LogError(
ex,
"Machine command failed. RunId={RunId}, MachineId={MachineId}, Command={Command}, DurationMs={DurationMs}, StartedAtUtc={StartedAtUtc}",
runId, machineId, commandName, sw.ElapsedMilliseconds, startedAt);
throw;
}
}Timing logs are extremely useful for diagnosing slowdowns before they become outright failures.
PART 5 — COMMON MISTAKES (VERY REALISTIC)
Logging too little
This is the classic failure.
Symptoms:
- only final error is logged
- no state transitions
- no operation ids
- no step durations
- no context about input or machine state
Production consequence:
You know something failed, but you cannot explain why. Engineers end up guessing, adding temporary logs, and waiting for the bug to happen again.
This is expensive and embarrassing in front of customers.
Logging too much
The opposite problem is also real.
Symptoms:
- every method entry/exit logged
- every UI binding event logged
- every loop iteration at Info level
- huge raw payload dumps for every message
- thousands of repetitive logs during normal operation
Production consequence:
- storage cost grows
- important signals drown in noise
- incident analysis becomes slower, not faster
- log viewers become almost unusable
- performance may degrade under load
A noisy system is not an observable system. It is just a loud system.
Missing context
This is deadly.
You may have hundreds of logs, but none say:
- which run?
- which machine?
- which wafer?
- which recipe?
- which workflow step?
Production consequence:
You cannot reconstruct one incident from concurrent activity.
In industrial apps, many things may be happening at once. Without context, all failures blur together.
Logging only errors without flow
Many teams only log when something goes wrong.
That sounds reasonable, but it breaks debugging.
Why?
Because the error log tells you what failed, but not what led to it.
Example:
- Error:
Timeout while waiting for AutoFocus
Useful, but incomplete.
You also want to know:
- was the recipe just switched?
- had the machine recently reconnected?
- did a slow warning happen 20 seconds earlier?
- was the workflow already in recovery mode?
- had cancellation already been requested?
Production consequence:
You see the crash, not the path to the crash.
Ignoring background task failures
This is one of the most dangerous desktop-system mistakes.
Examples:
- fire-and-forget task throws and nobody observes it
- polling loop dies silently
- channel consumer exits and pipeline quietly stops
- retry worker crashes and never restarts
From the operator’s view, the system becomes “weird”:
- data stops updating
- UI still looks alive
- machine state no longer refreshes
- workflows hang waiting for signals that no worker is processing
Production consequence:
Silent data loss, stale UI, stuck workflows, and terrifying nondeterministic behavior.
In real systems, background work must be supervised, and failures must be surfaced loudly.
PART 6 — PERFORMANCE & TRADE-OFFS
Logging overhead
Logging is not free.
Costs include:
- string formatting
- allocation of objects/properties
- serialization cost for structured fields
- disk or network I/O
- lock contention in some sinks
- pressure on CPU and memory under heavy volume
In high-throughput systems, careless logging can become part of the performance problem.
Examples:
- logging every frame in an image pipeline at Info level
- logging every status poll at high frequency
- writing logs synchronously to disk from hot paths
- logging large payloads or raw image metadata constantly
Synchronous vs asynchronous logging
Synchronous logging is simpler, but riskier for performance-sensitive operations.
If a hot path waits for disk I/O or slow log sink flushing, you introduce latency into the production flow.
That is bad in machine-control paths.
Asynchronous logging reduces the direct impact on the caller, but introduces trade-offs:
- logs may be delayed
- buffered logs may be lost on crash if not flushed
- queue overflow strategies matter
- diagnosing shutdown issues gets harder
In practice, many production systems use async/buffered sinks for throughput, but are careful to flush on shutdown and keep critical failure paths reliable.
Balancing detail vs performance
This is senior-level judgment.
You do not want to log everything. You want to log the things that explain behavior.
Good candidates for always-on Info logs:
- run started/completed/aborted
- state transitions
- machine connect/disconnect
- command send/complete/fail
- workflow step start/finish/fail
- critical retries and recoveries
Good candidates for Debug logs:
- detailed protocol chatter
- queue depth changes
- polling details
- fine-grained timing
- verbose SDK callback traces
The practical pattern is:
- keep high-value lifecycle logs always on
- keep high-volume diagnostics available but controlled by level/configuration
- avoid expensive payload logging in hot loops unless temporarily enabled for incident analysis
PART 7 — SENIOR ENGINEER THINKING
How experienced engineers design logging strategy
A senior engineer does not treat logging as an afterthought. They design it as part of system behavior.
They ask:
- What failures will happen in the field?
- What will support or engineers need to know?
- Which workflows need end-to-end traceability?
- Which identifiers must be present on every log?
- Which background processes can fail silently?
- What should be visible at Info vs Debug?
- How will we diagnose timing issues?
That means logging is designed around real operational questions, not random LogInformation calls.
What to log vs what not to log
Log:
- lifecycle events
- state transitions
- external commands and outcomes
- retries, fallbacks, timeouts
- workflow boundaries
- background worker start/stop/failure
- operation durations
- degraded modes and partial failures
Usually do not log:
- every trivial method call
- repetitive noise with no diagnostic value
- huge objects or payloads by default
- sensitive data
- high-frequency events at high severity
The question is always:
Will this help explain system behavior later?
If yes, it is probably worth logging. If not, it is probably noise.
How to make logs actionable for debugging
Actionable logs answer real engineering questions.
A useful log usually contains:
- what happened
- where it happened
- which operation/run it belongs to
- which entity was involved
- what the system was trying to do
- whether it succeeded, failed, retried, or degraded
- how long it took
- exception details if relevant
Bad log:
Error in workflowActionable log:
Failed to transition workflow step from AutoFocus to Capture after machine timeout.
RunId=RUN-20260321-001 MachineId=M-03 Recipe=ABC-2026 DurationMs=5102 RetryCount=2That gives engineers something to work with.
How to design systems that are diagnosable under pressure
This is the real mark of maturity.
When production is on fire, nobody wants clever architecture that cannot explain itself.
Diagnosable systems have these traits:
- explicit states instead of random booleans
- clear workflow boundaries
- correlated logs across components
- supervised background tasks
- meaningful error classification
- timing visibility
- recoveries and retries logged as first-class events
- enough information to reconstruct a timeline
A senior engineer thinks beyond “Does it work?” They think: “When it fails at 2 AM on a customer machine, can we understand it fast?”
That mindset changes architecture.
You start building systems that expose their own behavior instead of hiding it.
Final takeaway
In industrial .NET desktop systems, observability is not just about logs. It is about making runtime truth visible.
A production-grade WPF machine-control system is full of concurrency, timing sensitivity, hardware uncertainty, and long-running workflows. When issues happen, the debugger is usually gone. The code is no longer enough. What matters is the evidence the system left behind.
Good logging gives you that evidence.
Not too little. Not too much. Just enough structured, correlated, high-value information to reconstruct what really happened.
That is how senior engineers design systems that can survive real production pressure.
If you want, I’ll do the same style deep dive next for metrics/tracing vs logging, or for Serilog + Microsoft.Extensions.Logging architecture in WPF desktop apps.